What is conditional probability?

Unconditional or marginal probability

Let us say you want to know if a random day in the year is a winter day, is a snow day, and what the relationship between winter and snow days is.

We start by drawing a circle, which symbolizes all days in a year:

This circle captures all 4 possible day types: winter days with and without snow and days in the other seasons with and without snow. The probability that a day belongs to any of these day types is 1. Therefore, we say that the area of the circle is also 1.

We know that a quarter of the days in a year are in winter months. We visualize this by highlighting a section of the circle in icy blue.

From this we can see that the probability of winter is 1/4, or 0.25. This is an unconditional (or marginal) probability, and we can write \(P(winter)\) = 0.25.

Snowy days are usually in winter. Therefore, we draw an ellipse for snow days that largely overlaps with the quadrant for winter days.

The complement rule (or subtraction rule) for probabilities says that given the probability of an event \(P(A)\) (snow) the probability that this event is not happening is \(1-P(A)\). \(P(\textrm{not}A) = 1-P(A).\)

The unconditional (or marginal) probability of snow, \(P(snow)\), is the probability of snow independent if it is winter or not, and corresponds to the area of the grey ellipse: \(P(snow) =\) 0.14.

Joint probability

The joint probability of winter and snow \(P(winter, snow)\) or \(P(winter \textrm{ and } snow)\) is the ellipse-area that overlaps with the bottom right quadrant, which is shown in the next figure.

The addition rule for probabilities says that the probability that either of two events \(P(A)\) (winter) or \(P(B)\) (snow) happens is the the sum of the marginal probabilities \(P(A) + P(B)\) minus the joint probability (snowy winter days). \(P(A \textrm{ or } B) = P(A)+P(B)-P(A \textrm{ and } B)\) where \(P(A \textrm{ and } B)\) is the joint probability.
Then we can also write the joint probability as \(P(A \textrm{ or } B) - P(A) - P(B)= -P(A \textrm{ and } B)\) and \(P(A \textrm{ and } B) = -(P(A \textrm{ or } B) - P(A) - P(B))\)

Conditional probability

A conditional probability gives the answer to the question “Given that it is winter, what is the probability of snow”. This amounts to the question “What is the size of the ellipse-area that overlaps with the bottom right quadrant, relative to the area of the bottom right quadrant”. The next figure visualizes this.

This is an application of the product rule for probabilities.
If we have two events \(A\) (winter) and \(B\) (snow), the joint probability of both events \(P(A,B)\) is calculated by multiplying the unconditional probability \(P(A)\) (winter) with the conditional probability \(P(B|A)\) (the conditional probability of snow given winter). If we rearrange \(P(A,B) = P(A) \cdot P(B|A)\) we get \(P(A,B)/P(A) = P(B|A)\).

To obtain a conditional probability, we are “conditioning” our question about the probability of snow on the value of another variable, namely that the season is winter. In comparison, the joint probability asks “What is the size of the trimmed ellipse-area, relative to the total circle area of 1”.

The following figures show fractions that make even clearer hat the joint probability is the probability of snowy winter days divided by the probability of any day (which is just one), and that the conditional probability is the probability of snowy winter days divided by the probability of winter days.

Joint probability: \(P(A,B) = P(A) \cdot P(B|A)\) or \(P(A,B) = P(B) \cdot P(A|B)\)
Conditional probability: \(P(B|A) = P(A,B)/P(A)\) and \(P(A|B) = P(A,B)/P(B)\).

Note that because the probability of a winter day is smaller than 1 and dividing by a number smaller than one makes a number larger, dividing by the marginal probability of winter ensures that the conditional probability is larger than the joint probability. This makes sense, as snow in winter has to be more likely than winter-snow in the whole year.

So, to calculate the conditional probability \(P(snow|winter)\) we want to answer the question “What is the size of the ellipse-area that overlaps with the bottom right quadrant, relative to the area of the bottom right quadrant”. Unfortunately, there is no simple equation for calculating the overlap of an ellipse and a circle quadrant. But we can approximate the correct answer by randomly placing points into the circle and checking if they are in the bottom right quadrant (winter), in the ellipse (snow), or in the area of the ellipse that overlaps with the bottom right quadrant (winter and snow).

Here is the code for the simulation. You don’t need to understand all of it, the important part is that we are keeping track of winter and snow days in the vectors with the names is.winter and is.snow, respectively.

set.seed(123)
draw_pie()
draw_ellipse()

N = 365
is.winter = vector(length = N) # vector to count winter days
is.snow = vector(length = N) # vector to count snow days

for (k in 1:N) {
  # generate random point with custom function
  xy = rpoint_in_circle()
  
  # check if it is a snow day, i.e. in ellipse, with custom function
  is.snow[k] = in_ellipse(xy,h.e,k.e,a.e,b.e,e.rot)
  # check if it is a winter day
  is.winter[k] = xy[1] > 0 & xy[2] < 0
  
  # plot points
  points(xy[1],xy[2],
         pch = ifelse(is.snow[k] == T,8,21), cex = .75,
         bg = ifelse(is.winter[k] == T,"blue","red"),
         col = ifelse(is.winter[k] == T,"blue","red"))

}

legend(.75,.8,
       pch = c(8,21,15,15), bty = "n",
       col = c("black","black","blue","red"),
       legend = c("snow","no snow", "winter", "no winter"))

Let’s first calculate the probability of winter, which should be around 0.25. This is simply the number of blue dots divided by the total number of dots.

N_winter = sum(is.winter)
P_winter = N_winter/N
P_winter %>% round(2)
## [1] 0.25

Now, the probability of snow (star-shaped dots divided by total number of dots):

N_snow = sum(is.snow)
P_snow = N_snow/N
P_snow %>% round(2)
## [1] 0.15

And now it gets interesting. For the joint probability of winter and snow \(P(winter, snow)\) we count the number of blue stars.

# logical indexing:
# is.snow[is.winter] returns only those entries of the 
# vector is.snow that are at positions where the 
# value for is.winter is TRUE
N_winter_and_snow = sum(is.snow[is.winter]) 
P_winter_and_snow = N_winter_and_snow/N
P_winter_and_snow %>% round(2)
## [1] 0.12

Check the last code block and see that to get the joint probability we divide by the total number of dots N.

In contrast, for conditional probabilities, we want to divide by the number of dots that have the value we are conditioning on. If we want to calculate the conditional probability \(P(snow | winter)\), we therefore have to divide by the number of winter dots:

P_snow_given_winter = N_winter_and_snow/N_winter
P_snow_given_winter %>% round(2)
## [1] 0.48

If you check further above, you can see that P_winter_and_snow and P_winter are respectively calculated by dividing N_winter_and_snow and N_winter with N. Therefore, N_winter_and_snow/N_winter and P_winter_and_snow/P_winter have the same result and we can also write:

P_snow_given_winter = P_winter_and_snow/P_winter

Hopefully, you recognize now that we have the conditional probability on the left side (P_snow_given_winter or \(P(snow|winter)\)), which we calculate with help of the joint probability (P_winter_and_snow or \(P(snow, winter)\)) and the unconditional (marginal) probability (P_winter or \(P(winter)\)) on the right side:

\[ \overset{\color{violet}{\text{conditional probability}}}{P(snow|winter)} = \frac{\overset{\color{red}{\text{joint probability}}}{P(snow, winter)}}{\overset{\color{blue}{\text{marginal probability}}}{P(winter)}} \]

or more abstract:

\[ \overset{\color{violet}{\text{conditional probability}}}{P(A|B)} = \frac{\overset{\color{red}{\text{joint probability}}}{P(A, B)}}{\overset{\color{blue}{\text{marginal probability}}}{P(B)}} \] and by multiplying with \(P(B)\) on both sides, we get first

\[ \overset{\color{violet}{\text{conditional probability}}}{P(A|B)} \cdot \overset{\color{blue}{\text{marginal probability}}}{P(B)} = \overset{\color{red}{\text{joint probability}}}{P(A,B)} \]

which is the same as

\[ \overset{\color{red}{\text{joint probability}}}{P(A,B)} = \overset{\color{violet}{\text{conditional probability}}}{P(A|B)} \cdot \overset{\color{blue}{\text{marginal probability}}}{P(B)} \]

This is the general product rule (or chain rule) that connects conditional probabilities with joint probabilities.

Can you also show that the following is true?

\[ P(A, B) = P(B|A) \cdot P(A) \]

Or, using the example above

\[ P(snow, winter) = P(winter|snow) \cdot P(snow) \]

Deriving Bayes rule

Exercise 2E3, which asks which expressions are consistent with “probability of Monday given that it is raining”, can be used to derive Bayes rule.

The correct answers are.

1: P(Monday | rain)

4: P(rain | Monday) * P(Monday) / P(rain)

The question is then if we can show that

\(P(Monday|rain) = P(rain|Monday) \cdot P(Monday) / P(rain)\)?

The key to the solution is that we can use both \(P(Monday|rain)\) and \(P(rain|Monday)\) to calculate the same thing: the joint probability of \(P(Monday,rain)\).

Page 37 shows the relationship of joint and conditional probability:

\[ P(A,B) = \color{blue}{P(A|B) \cdot P(B)} \\ P(A,B) = \color{red}{P(B|A) \cdot P(A)} \]

Therefore, we can say that the joint probability that it is Monday and raining is

\[P(Monday,rain) = \color{blue}{P(rain|Monday) \cdot P(Monday)}\]

or

\[P(Monday,rain) = \color{red}{P(Monday|rain) \cdot P(rain)}\] and we can further say

\[ \color{red}{P(Monday|rain) \cdot P(rain)} = \color{blue}{P(rain|Monday) \cdot P(Monday)} \]

If we want to know what P(Monday|rain) is, we now have to divide on both sides with P(rain), which gives us

\[ \color{red}{P(Monday|rain)} = \frac{\color{blue}{P(rain|Monday) \cdot P(Monday)}}{\color{red}{P(rain)}} \]

where the right hand side is answer 4 from above.

Maybe the last equation looks familiar. To make it a bit more recognizable, we can replace Monday with \(A\) and \(rain\) with \(B\):

\[ P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} \] This is Bayes Rule, which one uses to calculate the inverse conditional probability, i.e. when we have information about the probability of \(B\) given \(A\) and want to calculate the probability of \(A\) given \(B\).

What are probability distribution, and where do we need them for Bayesian statistics?

Functions

A probability distribution is a function, that is an object that receives input and gives an output. This is very general, and so we can have functions that return always the same value (\(f(x) = 0\)), that square a value (\(f(x) = x^2\)), or that check if a certain condition is met (\(f(x) = 1(X == 5)\). 1

In the context of Bayesian statistics analysis, we use probability distributions to describe a state of the world when we remain uncertain.

Domain of distributions

For such a function, we first need to be explicit what we are uncertain about. Continuing with the globe tossing example, we can say that we are uncertain about if we will catch the globe with the tip of the index finger on water.

If we were sure to land on water, we would say \(P(water) = 1\), and if we were sure to land on land, we would say \(P(water) = 0\). But because we are uncertain, anything between 0 and 1 is possible. Therefore, our function to describe this uncertainty should allow values between 0 and 1. We can start drawing the function by just specifying an x axis that goes from 0 to 1.

Probability density

Before we display uncertainty, let’s look at how this function looks if we are certain to land on water:

If we are certain to land on water, \(P(water)\) = 12 and all other probabilities have the value zero.

If we want to express uncertainty, we also have to allow for all other values. If we had no information whatsoever to say something about the probability to land on water, all probabilities should get the same value.

For this function to be a probability distribution, the area under the function (the integral) must sum up to 1.

To see this, we can observe in the next plot that the area under the probability function remains constant while we go from believing weakly (left) to more strongly (right) that the probability to land on water is larger than 0.5.

Probability distributions in R In R functions that return the density for a given value x for a distribution strta with d. For instance, dbeta(x = 0.5, shape1 = 1, shape2 = 1) returns the density for the value 0.5 under the beta distribution where the parameters shape1 and shape2 have the value 1. dnorm(x = 0.5, mean = 1, sd = 1) returns the density for the value 0.5 under the normal distribution with a mean of 1 and a sd of 1.
To generate random samples, we use functions that start with r: rnorm(n = 1000, mean = 1, sd = 1)` returns 1000 random numbers from a normal distribution with the same parameters.

To summarize, one can think of a probability distribution as a function that expresses how likely different values of a parameter (here p) are and whose area under the curve (or integral) is 1.

Depending on the nature of a parameter, different probability distributions must be used. Above, we use the so called beta distribution, because this distribution allows values between 0 and 1, which matches the fact that probabilities need to be between 0 and 1. For other phenomena different distributions can be used. For instance, we might want to use a normal distribution to characterize our uncertainty about tomorrow’s temperature or a Poisson distribution to characterize uncertainty about things we count, like e.g. the number of shoes a person has.

Probability distributions in Bayesian Statistics

In Bayesian statistics, we use such distributions to express three things:

  1. Prior judgement about the probability of different parameter3 values before seeing the data. the parameter \(p\) we introduced above describe the probability to land on water.
  2. The probability of different parameter values given the data. This is also called the likelihood
  3. The posterior probability of different parameter values given our prior judgement and the data.

Let’s walk through a simple example. We start by describing our prior judgement, that we are slightly confident that that index finger touches water rather than land, with a beta distribution:

We use the dbeta function for the prior

Next the likelihood. For the globe tossing example, we can think of each toss as a trial and of each landing on the index finger as a success. The distribution that gives the likelihood of different success probabilities \(p\) given a number of trials and successes is the binomial distribution. So we use this distribution to get the likelihood function. Let us assume we had 4 trials and 3 successes.

The likelihood is just the density of the observed data given different parameters values. Because we count successes out of trials, we calculate the likelihood with the dbinom function. (dbinom(x = 3, size = 4, prob = p), where p is a vector with parameter values.)

Now Let us re-introduce Bayes rule, which we described above as:

\[ P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} \]

If we just replace \(A\) with \(parameter\) and \(B\) with \(data\) and annotate the different terms we get

\[ \overset{\color{violet}{\text{posterior probability}}}{P(parameter|data)} = \frac{\overset{\color{red}{\text{likelihood}}}{P(data|parameter)} \cdot \overset{\color{blue}{\text{prior probability}}}{P(parameter)}}{\overset{\color{orange}{\text{evidence}}}{P(data)}} \]

This shows us that if we just multiply the likelihood with the prior, two things we just calculated and plotted above, we get something that is proportional to the posterior probability of the parameter (probability to land on water with the index finger on water) given the data. This is what is meant if you see this expression:

\[ posterior \propto likelihood \cdot prior \]

Let us just calculate and plot this:

prior_x_likelihood = prior * likelihood

This distribution is only proportional to the posterior distribution, because the product of posterior and likelihood does not sum up to 1. We can calculate the posterior probability distribution by dividing by the sum.

s = sum(prior_x_likelihood)
posterior = prior_x_likelihood/s
c(1/s, sum(posterior))
## [1] 77.34548  1.00000

The following figure illustrates how we get from the the un-normalized posterior to the normalized posterior, which sums to 1, by multiplying with a constant, which is just 1/prior_x_likelihood.

The next plot shows the posterior distribution together with the prior distribution and the likelihood. We are also adding a plot for the un-normalized posterior with a dotted line.

The figure shows that the posterior is a compromise between the prior distribution and the likelihood. That is, it is a compromise between our information before we saw the data and the information that is in the data.

Because we had relatively little data compared to the information in the prior, we can still clearly see the influence of the prior in the posterior. However, if we collect five times the data, we become more certain (the posterior distribution is narrower) and the influence of the prior is diminished so that the posterior will be very similar to the likelihood:

So it is not so easy to “cheat” with priors to get what one wants, provided one has collected sufficient data of course.


  1. Text due to Tomás Varnet, who suggested to also explain what a function is.↩︎

  2. The figure shows density, not probability↩︎

  3. Parameters are variables that describe characteristics of distributions, like for example the mean and standard deviation of the normal distribution↩︎